Teaching computers to fold proteins 
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A new general algorithm for optimization of potential functions for protein folding is introduced. 
It is based upon gradient optimization of the thermodynamic stability of native folds of a training set 
of proteins with known structure. The iterative update rule contains two thermodynamic averages 
which are estimated by (generalized ensemble) Monte Carlo. We test the learning algorithm on 
a Lennard- Jones (LJ) force field with a torsional angle degrees-of- freedom and a single-atom side- 
chain. In a test with 24 peptides of known structure, none folded correctly with the initial potential 
functions, but two-thirds came within 3A to their native fold after optimizing the potential functions. 

PACS numbers: 05.10.-a,07.05.Mh,87.15.Aa,87.15.Cc 



It is one of the long-standing challenges of science to 
simulate protein folding in a computer and predict the 
three-dimensional structure - the native fold. According 
to Anfmsen's hypothesis the native fold of a protein is the 
one with the lowest free energy 1]. To fold a protein in 
silico, it is therefore necessary to have a sufficiently good 
description of the energetics of the system. Even the most 
sophisticated all-atom potentials [2|,LJ and statistical po- 
tential functions 0, 0, Q will not usually give stability of 
an experimentally determined native structure. Further- 
more, these potential functions have so many degrees of 
freedom that nano-second time-scale molecular dynam- 
ics simulations require of the order of months on even the 
fastest computers. To sample the state space of a protein 
in solution with present-day computers it is therefore nec- 
essary to use a simplified description of the protein and 
the solvent rather than an all-atom model. It is virtually 
impossible to calculate such potential functions from first 
principles. 

In this paper we describe a method to estimate 
parametrized potential functions from a training set of 
known protein structures. Most previous work on esti- 
mation of potentials use statistical approaches 0, H, > 
which are based on static structures. The main new fea- 
ture in our approach is that we optimize the potentials 
during simulation of the folding process, so as to maxi- 
mize the thermodynamic probability of the native folds 
of the whole training set. This maximum likelihood esti- 
mation procedure, which is essentially Boltzmann learn- 
ing [3, can be thought of as iteratively stabilizing the 
native structure on the one hand and 'unlearn' incor- 
rect folds, which traps the protein during folding, on the 
other. There exists other approaches that are similar 
in spirit, but none which aim directly at optimizing the 
thermodynamic stability. Rather, related measures are 
optimized such as the normalized difference between the 
native energy and the average energy over all alternative 
conformations || E| ^} , the thermodynamic average of 
the overlap to the native state in a contact energy model 
[HI Hil and linear optimization methods for ensuring 
that the native state has lowest free energy 

HE3. The 



overlap method is the one closest related to optimizing 
the thermodynamic stability. 

In the general setup we have a parameterized energy 
function ^(R, seq) with parameters 0, which give the 
energy for an amino acid sequence seq with atomic co- 
ordinates R. The probability of finding the ith train- 
ing sequence in its native state is given by the Boltz- 
mann weighted volume of conformation space compatible 
with the native structure divided by the total Boltzmann 
weighted volume of conformation space 



P(nati|seq i , 0) = 



Jnati exp(-/?£fl(R, seqj) dR 
J exp(-/?£fl(R, seqj) dK ' 



(1) 



where /3 = 1/kT and the integral in the numerator is 
only over the part of conformation space associated with 
the native structure. The definition and choice of the 
size of the native volume in conformation space should 
reflect all expected variability such as the loss of descrip- 
tion accuracy due to the crudeness of the protein model, 
thermal variability of the native state and the uncertainty 
in the determination of the crystal/NMR structure. In 
this study we define the native volume as all structures 
within a C a root mean square deviation (RMSD) of 1A 
from the crystal structure. 

The objective is to maximize the joint probability 
YiiLi -P(nat^|seq^, 0) with respect to the parameters 
where N is the number of sequences in the training set. 
We choose to perform the maximization by gradient as- 
cent <9 new := old + A0, where 



AO = rjVe^2lnP(na,t i \seq i ,0) 



(2) 



= [<V^(R,seqJ) - ((V^(R, seq,)) nat J 

with 77 being the learning rate, (...) and (• • -) n at- de- 
noting Boltzmann averages over the total conformation 
space and the part associated with the native structure, 
respectively. In neural computation context this is known 
as the Boltzmann learning rule Q . In simulations we per- 
form the Boltzmann averages by a (generalized ensemble) 
Monte Carlo method. 
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The above learning rule applies to any differentiate 
potential function. The aim is to estimate a potential 
that gives a high probability to the correct fold under 
the given protein model and with the chosen simulation 
method, which of course does not guarantee that the po- 
tential is close to the real physics. To demonstrate the 
validity of the method we have applied it to a simple force 

Ri 

field. The amino acid unit model has 6 atoms —NC a C—, 

H O 

where O and H is introduced to be able to define back- 
bone hydrogen bonds and to give a more realistic local 
torsion potential. The whole side chain is represented by 
Ri, so the parameters relating to this is amino acid de- 
pendent, whereas the other atoms are treated more con- 
ventionally. The conformational degrees of freedom are 
the torsional angles <\> and i/j (rotation around the NC a 
bond and C a C bond). The C a Ri distance is adaptive 
(one parameter for each of the 20 amino acids), whereas 
all other bond lengths and ang les are fixed to their aver- 
age value as given in Ref. 16]. The angle to Ri is fixed 
to the average value for C a Cp. 

The energy is a generalization of the one proposed in 
Ref. 17] which in turn is inspired by the classical force 
fields. The energy is split into local and non-local terms 
E = i£iocai + -^non-local- The local interactions are mainly 
introduced to model local steric constraints and the non- 
local consists of three types of terms introduced to model 
pairwise interactions, hydrophobic, surface and related 
effects (hp) and hydrogen bonding (hb): ^non-local = 

^hp + ^hb • 

The local energy contains two types of terms 

^local = f Ei(l + COS %) + ^Ei(l + COs3^i) + 

12 / VV \Q\ 

The sums on i 
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are over the individual amino acids. The first two terms - 
which are probably not so important - introduce a three 
fold symmetry and has two parameters independent of 
the side chains. The last term is mainly introduced to 
model steric constraints, and the sum over X, Y runs over 
a restricted set of pairs of local atoms in amino acid z, 
i + 1 and i + 2 (with terms set to zero when i + 1 or 
i + 2 are larger than the length of the protein). The 
following atom pairs are included: H(i)R(i), R(i)0(i), 
H(i)H(i + 1), 0(i)H(i + 2) and 0(i)0(i + 1). Only the 
first two pairs depend on the amino acid side chain, so 
this introduces a total of 2*(20+20+2+2+2)=92 param- 
eters. 

Hydrogen bonding is only considered for back- 
bone NH and CO pairs. The hydrogen bond en- 
ergy contains both angle dependence and a 12-10 
Lennard- Jones potential with two adjustable parameters 



£hb = ^6E»i^i|(^r)-2(^fr) J, where 
cos 2 ^°cos 2 ^ oc/ for tt/2 < 0™ 6^ oc ' < 3^/2 



and O(j), 0§ HO is the N(i)H(i)0(j) angle and 6%°°' 
the H(i)0(j)C(j) angle. 

The hydrophobic interaction Eh p consists of two 
types of terms. The first one is a pure radial 12- 
6 Lennard Jones potential that should take into ac- 
count all non-local, hydrophobic and other forces be- 
tween the ith amino acid ai and the jth dj. Both C a C a 
and RR interactions are included. The C a C a interac- 

tion, i.e. Ei>j ^fa.a,-) js (^f^f- 6 (^f^J], 

where rg- is the C a (i)C a (j) distance, is mainly in- 
troduced to model steric constraints. The sec- 
ond type of term, which plays a minor role, 
is a surface energy term inspired by Refs. ^| 
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where 



and zero otherwise, 



ij 



HO 



is the distance between H(i) 



f{x) = 0.5(1 + tanh(x)) G [0,1] is a sigmoid non- 
linearity. The surface energy term is chosen such that 
e surf _ £ surf^ a ^ ^ g energy change induced by taking 
side chain a$ from being completely exposed to solvent 
to being completely buried. The inner sum counts the 
number of neighboring amino- acids and the adjustable 
parameters Si, hi, <j^ urf '°, <rf urf set the relevant scale and 
bias for burial of each amino acid. The total number 
of adjustable parameters for the hydrophobic energy is 
2*2*(20*19/2+20)=840 for Lennard Jones and 5*20=100 
for surface. 

The temperature scale is arbitrary since the Boltzmann 
weight only depends upon the product j3E and the scale 
of E is set during training. In the test below we choose 
the folding 'temperature' l//?foid = 0.1 for all training 
examples and set the initial parameter values to be on 
the same scale. The initial values are chosen such that 
all amino- acids have the same parameter values, except 
for different surface energy terms (e surf (aa^)). 

To get the learning algorithm eq. © to work properly 
we need reliable estimates of thermodynamic averages. 
To achieve this we use parallel tempering |l9( , where the 
system is simulated independently at a number A^ te mp of 
different temperatures. In this study A^ te mp = 15 rang- 
ing from T m i n = 0.1 to T max = 0.8 with equal spac- 
ing on a log-scale. Once every cycle, where a cycle 
consists of iVconf elementary conformation updates, the 
temperature of two random systems (adjacent in tem- 
perature) are exchanged with the Metropolis probabil- 
ity P(accept) = min(l,exp(A/?A£)) where AE and Af3 
are the energy and inverse temperature difference be- 
tween the two systems respectively. In our case we use 
^conf = 40 of which one quarter are pivot moves (rotation 
around a random torsional angle) and the rest are 'local' 
moves which is defined as choosing two torsional angles 
next to the same peptide bond (i.e. ipi and <^i+i) and ro- 
tate them opposite angles. We collect statistics for 10 4 
cycles between each update of the adjustable parameters. 
We choose the folding temperature to be the minimum 
temperature Tfoid = ^min and thus only use the statistics 
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for this temperature to update the parameters. To en- 
sure sampling of the native conformations, a number of 
the N temp systems are initialized in the native state. The 
remaining systems are initialized in different low energy 
states found in the previous update to ensure fast focus 
on relevant regions of conformation space. Subsequent 
long runs starting from coil confirm that this procedure 
is sufficiently close to generate equilibrated samples. As 
an alternative to the batch update rule eq. 0, one can 
use an online version where the parameters are updated 
using a single training example at a time. In this study, 
we use an intermediate approach, where we update the 
parameters using three batches each with one third of the 
example. The results presented below are obtained using 
approximately 500 parameter updates. The training set 
consists of a small set of 24 protein fragments (or pep- 
tides) of length 11-14 of mainly a-helices and /3-turns 22]. 
They have been suggested to adopt their native structure 
even as fragments |2lj. Running the training on 8 pro- 
cessors on a Silicon Graphics Origin 3000 computer, 500 
parameter updates take approximately 2 CPU weeks. 

We tested the final potential by initializing the 24 
training sequences in random coil. After an initial equi- 
libration, conformations were saved with fixed intervals 
at the folding temperature Tfoid = 0.1 in a long test run. 
These sampled conformations are called decoys below. 
Some results from the test run are shown in Figure 
The decoys are clustered by introducing a RMSD cut- 
off of 0.5 A and assigning as the first cluster center the 
decoy with the most neighbors within the cut-off. The 
clustering is not very sensitive to the specific choice of 
the cut-off. We remove these decoys and repeat the pro- 
cedure until all decoys have been assigned to a cluster. 
The number of decoys in the cluster is directly related 
to the free energy F(clus i) = — TlnP(clus z), where the 
probability P(clus i) of cluster i is the number of decoys 
in cluster i divided by the total number of decoys. The 
decoy plots reveal a complex free energy landscape with 
competing minima. In a few cases free energy minima 
both with small and large RMSD to the native fold ex- 
ist simultaneously. In the subsequent analysis the cluster 
center of the cluster with the lowest free energy was cho- 
sen as the predicted fold. 

The performance on the training set is summarized in 
Figure El The trained potential is compared to the ini- 
tial essentially homo-polymer potential and the results 
of folding with an all-atom potential from 20]. A clear 
improvement over the initial potential is seen, indicating 
that the training process actually works. The results are 
also comparable to the all-atom potential, which shows 
that our approach is comparable to an all-atom poten- 
tial that requires many human expert man hours to de- 
rive. Probing the significance of the folding tempera- 
ture by performing the test run at a lower temperature 
T = 0.025 < Tfoid shows that the overall performance — 
in terms of RMSD for the largest cluster — is significantly 
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FIG. 1: Color online. Decoy plots - energy versus RMSD (A). 
Cluster RMSD cut-off is 0.5 A. Cluster centers are marked 
with a circle. Blue codes for the largest cluster. 
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FIG. 2: The RMSD histogram for the minimum free energy 
conformations for the 24 peptides after training (black), the 
results for an all-atom potential 20] (hatched) and prior to 
training (gray). 



worse. 

To get an understanding of the successes and failures 
of the potential we have visualized low energy structures 
(see fig. 0) and made Ramachandran plots for the amino 
acids. The successful predictions are very native-like 
making the same hydrogen bonds as the native struc- 
ture. However, some of the side chains are very close 
together. Although a side chain should be regarded as 
'effective' with degrees of freedom averaged out and not 
as an atom, the small distance means that the character- 
istic separation in the Lennard Jones potential is small 
and will be sensitive to small changes in the distance 
between the side chains. The structure for some of the 
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FIG. 3: Color online. Representative decoys from the lowest 
free energy cluster for four peptides from top left to right 
(ID, RMSD from native, native structure type): 5CYT 88- 
101, 1.04A, a-helix; 2MHR 67-78, 4.74A, a-helix; 2I1B 69-82, 
4.4A, /3-turn and 2I1B 103-112, 1.63A, /3-turn . 

failures are not 'protein-like' and some of the amino acids 
are not reproducing the Ramachandran behavior found 
for real proteins. 

These findings show that the principle works, however, 
it is clear that the potential function model can be im- 
proved in many ways. One of the great advantages of the 
method is that many terms can be added and if they do 
not work well their weight would end up being very low. 
However, it should be kept in mind that the better the 
starting point, the more likely it is to reach a reasonable 
parameter set. The test suggests that the representation 
of side chains in the potential with just one pseudo-atom 
and a fixed angle is too crude. One remedy is to make 
the side chain model more realistic, e.g. by introducing 
an explicit Cp atom. This would probably make the Ra- 
machandran behavior 'protein-like' and remove some of 
the false minima the model is currently struggling with. 
It is also possible to go in the opposite direction and use a 
more restricted conformational search space, e.g. by only 
sampling experimentally observed Ramachandran angles 
or using an I-Sites library to generate conformations |l8q . 
The two different views are complementary and the re- 
sults of the CASP exercise has shown that it is important 
to pursue both to generate good ab initio predictions PJ| . 

The ultimate goal of optimizing potentials is to obtain 



reasonable predictions for sequences not in the training 
set (generalization). Preliminary runs on such test se- 
quences show poor generalization, which is primarily a 
result of the small training set. It is therefore important 
to now scale up to a more realistic size using more and 
longer sequences. We are currently working on ways to 
speed up the whole process to achieve this goal. 

More details about parameter settings, data sets and 
results can be found at www.imm.dtu.dk/~owi/. 
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